Scott the Scientist did an analysis of RNA-Seq data from
the development of the mouse cortex at days -8,
-4, 0 , 1, 7,
16, 21, and 26 taken from the
Allen Brain Atlas Developing Mouse Portal http://developingmouse.brain-map.org/. This type of data
is know as time series data since the features are taken through
time.
Scott begins by reading in the data and preparing the data frame. The columns are days at which the samples are collected where DayNegN, Day0, and DayPosN means N days before birth, day of birth, and N days after bith. The entries in the columns are the amount of RNA for each gene detected on that day in the mouse embryo cerebral cortex. We then change the columns’ name to their actual meaning and create a matrix for future analysis.
# read data as data frame
Mouse.df <-read.csv("datasets/MouseHomologData.csv", row.names = 1)
head(Mouse.df) # preview of the data frame
## DayNeg8 DayNeg4 Day0 DayPos1 DayPos7 DayPos16
## A1BG -0.54006172 1.62018517 1.62018517 -0.54006172 -0.5400617 -0.54006172
## A4GALT 0.77955207 -0.01579633 1.82626020 0.64704748 -0.6928331 -0.84321867
## AAAS 2.02780897 0.97337784 0.06271547 -0.47846790 -0.4818574 -0.70894012
## AADAT 0.05395975 -1.45584988 0.62811270 -0.77537230 1.9295261 0.05395975
## AAMP -0.58236089 -1.82323478 0.59797889 1.49446830 0.1356618 -0.56373317
## AANAT -0.44728620 -0.64756360 -0.11349053 -0.04673139 2.3966529 -0.04673139
## DayPos21 DayPos28
## A1BG -0.54006172 -0.5400617
## A4GALT -0.86458745 -0.8364241
## AAAS -0.71097391 -0.6836630
## AADAT -0.04173241 -0.3926037
## AAMP 0.11103932 0.6301806
## AANAT -0.64756360 -0.4472862
# change column name
colnames(Mouse.df)<-c("-8", "-4", "0", "1", "7", "16", "21", "28")
head(Mouse.df) # preview of the data frame
## -8 -4 0 1 7 16
## A1BG -0.54006172 1.62018517 1.62018517 -0.54006172 -0.5400617 -0.54006172
## A4GALT 0.77955207 -0.01579633 1.82626020 0.64704748 -0.6928331 -0.84321867
## AAAS 2.02780897 0.97337784 0.06271547 -0.47846790 -0.4818574 -0.70894012
## AADAT 0.05395975 -1.45584988 0.62811270 -0.77537230 1.9295261 0.05395975
## AAMP -0.58236089 -1.82323478 0.59797889 1.49446830 0.1356618 -0.56373317
## AANAT -0.44728620 -0.64756360 -0.11349053 -0.04673139 2.3966529 -0.04673139
## 21 28
## A1BG -0.54006172 -0.5400617
## A4GALT -0.86458745 -0.8364241
## AAAS -0.71097391 -0.6836630
## AADAT -0.04173241 -0.3926037
## AAMP 0.11103932 0.6301806
## AANAT -0.64756360 -0.4472862
# create matrix
Mouse.matrix <- as.matrix(Mouse.df)
Note that the data already been scaled so that the analysis can focus on the shape of the time series rather than specific magnitudes. We can confirm the scaling is successful by calculating the row means and making sure their norm was near 0.
norm(rowMeans(Mouse.matrix))
## [1] 6.127969e-08
We used K-means clustering to create five clusters based on domain knowledge: biologists believe there are five stages of brain development, so we select five clusters.
set.seed(300)
km <-kmeans(Mouse.matrix, 5)
Examining the heatmap of the K-means cluster centers, we can see that each cluster corresponds to different average peaks of gene expressions.
heatmap.2(
x = km$centers, Colv = FALSE,
dendrogram = "none", trace ="none",
main = "K-means Cluster Centers",
)
We can also see the time trends in the clusters means by plotting each
cluster mean as a line. The cluster means have to be reformatted into a
data frame with columns
Cluster, Day and
Mean. This is done using the dplyr package
gather() command.
# cluster means
tics <- c(-8,-4,0,1,7,16,21,28) # x-axis "tics" (for the plot)
clustermean.df <- as.data.frame(km$centers, row.names = c("1","2","3","4","5"))
clustermean.df
## -8 -4 0 1 7 16 21
## 1 -0.4496261 -0.6106791 1.3974736 1.1913655 0.03916353 -0.5570030 -0.5703216
## 2 1.9712955 0.4686750 -0.1298583 -0.4293038 -0.27156219 -0.5828320 -0.5422287
## 3 -0.7822509 -1.0207248 -0.6142838 -0.5672292 0.28858510 0.6305905 0.8961737
## 4 1.0614302 0.5152264 1.1091815 0.2865445 -0.28002391 -0.9195562 -0.9137313
## 5 -0.9336692 -1.1442140 -0.1620285 0.3392103 1.44873958 0.3088564 0.1023566
## 28
## 1 -0.44037268
## 2 -0.48418548
## 3 1.16913936
## 4 -0.85907127
## 5 0.04074883
# reformatted
clustermean.df <- clustermean.df %>%
rownames_to_column("Cluster") %>% # Make a new column called Cluster
gather(key="Day",value="Mean", -Cluster) %>%
convert(int(Day)) # convert Day to an integer
head(clustermean.df)
## # A tibble: 6 × 3
## Cluster Day Mean
## <chr> <int> <dbl>
## 1 1 -8 -0.450
## 2 2 -8 1.97
## 3 3 -8 -0.782
## 4 4 -8 1.06
## 5 5 -8 -0.934
## 6 1 -4 -0.611
We use the reformatted data frame and ggplot with
geom_line() to make the plots. Note our use of
facet_wrap() to generate individual plots by
Cluster.
ggplot(clustermean.df,aes(x=Day, y=Mean, col=Cluster)) +
geom_line() + geom_point() +
scale_x_continuous(breaks=tics) +
labs(title="Cluster Centers") +
facet_wrap(Cluster ~.) # Use facet_wrap to make a separate plot for each cluster
Here we make separate heat map for each of the five clusters. Plot the
cluster heatmaps in the order they occur in development. Note
that all the heatmaps are on the same scales and that they are cluster
on the genes but not the days.
heatmap.2(Mouse.matrix[km$cluster==1,],
scale = "none",
dendrogram = "row",
Colv=FALSE,
main = "Mouse Genes Cluster 1",
labRow= NA,
trace ="none")
heatmap.2(Mouse.matrix[km$cluster==2,],
scale = "none",
dendrogram = "row",
Colv=FALSE,
main = "Mouse GenesCluster 2",
labRow= NA,
trace ="none")
heatmap.2(Mouse.matrix[km$cluster==3,],
scale = "none",
dendrogram = "row",
Colv=FALSE,
main = "Mouse Genes Cluster 3",
labRow= NA,
trace ="none")
heatmap.2(Mouse.matrix[km$cluster==4,],
scale = "none",
dendrogram = "row",
Colv=FALSE,
main = "Mouse Genes Cluster 4",
labRow= NA,
trace ="none")
heatmap.2(Mouse.matrix[km$cluster==5,],
scale = "none",
dendrogram = "row",
Colv=FALSE,
main = "Mouse Genes Cluster 5",
labRow= NA,
trace ="none")
We visualize the cluster using a biplot of two components generated by PCA which explain 74% of the variance.
# Calculate the PCA
my.pca <- prcomp(Mouse.matrix, retx=TRUE, center=TRUE, scale=TRUE)
# Summarize, to see the complete PCA result
summary(my.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0364 1.3306 0.9370 0.66801 0.59296 0.54132 0.33712
## Proportion of Variance 0.5184 0.2213 0.1098 0.05578 0.04395 0.03663 0.01421
## Cumulative Proportion 0.5184 0.7397 0.8494 0.90521 0.94917 0.98579 1.00000
## PC8
## Standard deviation 3.306e-11
## Proportion of Variance 0.000e+00
## Cumulative Proportion 1.000e+00
We display the points in a “biplot.” The projection of the points makes an interesting disc-type shape which is less dense in the middle, like a donut.
We can see that the clusters are arranged in time order around the disc. Why do we see this?
# Calculate x and y scale limits for the biplot
t<-1.2*max(abs(my.pca$x[,1:2]))
# Generate the biplot using ggbiplot
p <- ggbiplot(my.pca,
choices=c(1,2), # Use PC1, PC2
alpha=.1, # Make dots transparent
varname.adjust=1.5, # Move variables names out a bit
scale =0, # Don't rescale data
groups=as.factor(km$cluster)) +
ggtitle('Mouse Biplot for PC1 and PC2') + xlim(-t,t) + ylim(-t,t) # title plot and make square
p
We hypothesize that each cluster of genes represents one of the five
“stages” of brain development labeled A,B,C,D,E. Assign each cluster to
its corresponding stage using the biplot. For example, Cluster 5
represents Stage D of development since it peaks at the first at the
point Day7.
Examine the scalar projections of the coordinate axes in the biplot,
for Days -8, -4, 0,
1, 7, 16, 21 and
28. Notice that the coordinate vectors act as hours on a
“developmental time clock” that starts at Day -8. Does
time on this development clock run clockwise or
counterclockwise?
You’ve now completed in class Prelab5! Go to LMS and complete the online quiz.
Using mouse gene and cluster technique, we find five stages for the mouse gene. Then, by the known human gene in some diseases, we intersect with mouse gene and stage we just find to find out which stage is enriched. Then, that stage is the most susceptible stage for the mouse of that disease. Then, we compare to the result of human that we already know.
From the prelab 5, we have that the order of the cluster is 2, 4, 1, 5, 3. Thus, this time, after k-means, we set cluster 2, 4, 1, 5, 3 as A, B, C, D, E in order.
# get sample data
Mouse.df <- read.csv("datasets/MouseHomologData.csv", row.names = 1)
colnames(Mouse.df)<-c("-8","-4","0","1","7","16","21","28")
Mouse.matrix <- as.matrix(Mouse.df)
# k-means
set.seed(300)
km <-kmeans(Mouse.matrix, 5)
rownames(km$centers) <- c('C', 'A', 'E', 'B', 'D')
Then, we do a PCA and provide the biplot using PC1 and PC2, colored according to the k-means cluster result.
# PCA
my.pca <- prcomp(Mouse.matrix, retx=TRUE, center=TRUE, scale=TRUE)
a.matrix <- km$cluster
# biplot using ggbiplot
label <- as.factor(km$cluster)
levels(label) <- c('C', 'A', 'E', 'B', 'D')
p <- ggbiplot(my.pca,
choices = c(1,2),
alpha = 0.1,
varname.adjust = 1.5, # Move variables names out a bit
scale = 0,
groups=label) +
ggtitle('Mouse Biplot for PC1 and PC2')
p
Next, we create a heat map that visualizes the result, with the order from cluster ‘A’ to ‘E.’ It’s easy to see a shift of higher value from cluster ‘A’ to ‘E.’
set.seed(300)
heatmap.2(km$centers,
scale = "none",
dendrogram = "none",
Colv=FALSE,
cexCol=1.0,
main = "Kmeans Cluster Centers",
trace ="none")
heatmap.2(Mouse.matrix[km$cluster==2,],
scale = "none",
dendrogram = "row",
Colv=FALSE,
main = "Mouse Genes Cluster A",
cexCol=1.5,
labRow= NA,
trace ="none")
heatmap.2(Mouse.matrix[km$cluster==4,],
scale = "none",
dendrogram = "row",
Colv=FALSE,
main = "Mouse GenesCluster B",
cexCol=1.5,
labRow= NA,
trace ="none")
heatmap.2(Mouse.matrix[km$cluster==1,],
scale = "none",
dendrogram = "row",
Colv=FALSE,
main = "Mouse Genes Cluster C",
cexCol=1.5,
labRow= NA,
trace ="none")
heatmap.2(Mouse.matrix[km$cluster==5,],
scale = "none",
dendrogram = "row",
Colv=FALSE,
main = "Mouse Genes Cluster D",
cexCol=1.5,
labRow= NA,
trace ="none")
heatmap.2(Mouse.matrix[km$cluster==3,],
scale = "none",
dendrogram = "row",
Colv=FALSE,
main = "Mouse Genes Cluster E",
cexCol=1.5,
labRow= NA,
trace ="none")
First we get the data point of Zika, and then make the biplot using these point. The PCA and cluster of data is from the mouse data
# get sample data
disease.df <- read.csv("datasets/Zikamicrocephaly_data.csv",row.names = 1)
disease_symbols <- intersect(as.character(disease.df$symbol), as.character(rownames(Mouse.df)))
label <- as.factor(km$cluster)
levels(label) <- c('C', 'A', 'E', 'B', 'D')
plot.df <- cbind.data.frame(my.pca$x, cluster=label)
myplot.df <- plot.df[disease_symbols,]
# biplot
ggplot()+
geom_point(data = myplot.df,
aes(x=PC1, y=PC2, color=cluster) )+
geom_segment(x=0, y=0,
aes(xend=3*my.pca$rotation[,1], yend=3*my.pca$rotation[,2]),
arrow=arrow(length=unit(1/2,'picas')) )+
geom_text(aes(x=3.3*my.pca$rotation[,1],
y=3.3*my.pca$rotation[,2],
label=c("-8","-4","0","1","7","16","21","28")),
size=4)
Def “cluster_pvals” to calculate pvalue and logodds of each cluster. The code come from the Lab5 handout from professor.
# Define cluster_pvals; DO NOT CHANGE!
cluster_pvals <- function(k, km, myplot.df) {
# Inputs: k, km, myplot.df
# Returns: results (dataframe with clusters, pvalues, logodds)
# Set the p-value and logodds to 0
pvalue <- zeros(k,1)
logodds <- zeros(k,1)
results <- cbind.data.frame(cluster=1:k, pvalue, logodds)
classdisease <- zeros(k,1)
classall <- as.vector(table(km$cluster))
# use dplyr to calculate counts for each cluster
temp <- myplot.df %>%
dplyr::group_by(cluster) %>%
dplyr::count(name="freq") # Creates 'freq' column!
classdisease[temp$cluster] <- temp$freq
classlogodds <- zeros(k,2)
totaldisease <- sum(classdisease)
totalall <- sum(classall)
# Calculate the log odds ratio for the disease
for (i in 1:k) {
n11 <- classdisease[i] +1 # genes in disease in cluster i
n21 <- totaldisease- classdisease[i] +1 # genes in disease not in cluster i
n12 <- classall[i]-n11+1 # genes not in disease and in cluster i
n22 <- totalall- n11-n21 -n12+1; # genes not in disease and not in cluster
res <- fisher.test(matrix(c(n11,n21,n12,n22), 2, 2))
results[i,]$pvalue <- res$p.value
results[i,]$logodds<- log((n11*n22)/(n12*n21))
}
return(results)
}
Next we use the help function “cluster_pvals” to calculte the log odds ratio of Zika disease for each cluster.
# Apply cluster_pvals using the parameters just generated
clusters <- cluster_pvals(5, km, myplot.df)
# Helper function to determine enrichment
threshold <- 0.1 # Normally set to 0.1
enriched <- function(p.value,logodds,p.threshold=0.1) {
if ((p.value <= p.threshold) && (logodds > 0)) {
return(TRUE)
}
else {
return(FALSE)
}
}
# Evaluate across our results; create new column
clusters$enriched <- mapply(enriched, clusters$pvalue, clusters$logodds,threshold)
# rename the cluster name of the output matrix
clusters$cluster[clusters$cluster==2]<-"A"
clusters$cluster[clusters$cluster==4]<-"B"
clusters$cluster[clusters$cluster==1]<-"C"
clusters$cluster[clusters$cluster==5]<-"D"
clusters$cluster[clusters$cluster==3]<-"E"
# View results
kable(clusters)
| cluster | pvalue | logodds | enriched |
|---|---|---|---|
| C | 0.0026364 | -1.0289834 | FALSE |
| A | 0.6139669 | -0.1544676 | FALSE |
| E | 0.0958240 | -0.5265662 | FALSE |
| B | 0.0000000 | 1.5996365 | TRUE |
| D | 0.0400381 | -0.7896979 | FALSE |
From the result, we see the stage B is enriched. Thus, for Zika disease, stage B is the most susceptable period.
Same process as above, first we get the data point of Cognitive Disorder, and then make the biplot using these point. The PCA and cluster of data is from the mouse data
# get sample data
disease.df <- read.csv("datasets/Cognitive disorder_heat_map_data.csv",row.names = 1)
disease_symbols <- intersect(as.character(disease.df$symbol), as.character(rownames(Mouse.df)))
label <- as.factor(km$cluster)
levels(label) <- c('C', 'A', 'E', 'B', 'D')
plot.df <- cbind.data.frame(my.pca$x, cluster=label)
myplot.df <- plot.df[disease_symbols,]
# biplot
ggplot()+
geom_point(data = myplot.df,
aes(x=PC1, y=PC2, color=cluster) )+
geom_segment(x=0, y=0,
aes(xend=3*my.pca$rotation[,1], yend=3*my.pca$rotation[,2]),
arrow=arrow(length=unit(1/2,'picas')) )+
geom_text(aes(x=3.3*my.pca$rotation[,1],
y=3.3*my.pca$rotation[,2],
label=c("-8","-4","0","1","7","16","21","28")),
size=4)
Next we use the help function “cluster_pvals” provides in lab 5 handout and defines above to calculte the log odds ratio of Rett syndrome disease for each cluster.
# Apply cluster_pvals using the parameters just generated
clusters <- cluster_pvals(5, km, myplot.df)
# Helper function to determine enrichment
threshold <- 0.1 # Normally set to 0.1
enriched <- function(p.value,logodds,p.threshold=0.1) {
if ((p.value <= p.threshold) && (logodds > 0)) {
return(TRUE)
}
else {
return(FALSE)
}
}
# Evaluate across our results; create new column
clusters$enriched <- mapply(enriched, clusters$pvalue, clusters$logodds,threshold)
# rename the cluster name of the output matrix
clusters$cluster[clusters$cluster==2]<-"A"
clusters$cluster[clusters$cluster==4]<-"B"
clusters$cluster[clusters$cluster==1]<-"C"
clusters$cluster[clusters$cluster==5]<-"D"
clusters$cluster[clusters$cluster==3]<-"E"
# View results
kable(clusters)
| cluster | pvalue | logodds | enriched |
|---|---|---|---|
| C | 0.1679806 | -0.1574106 | FALSE |
| A | 0.0000000 | -0.6771831 | FALSE |
| E | 0.0000000 | 0.7325017 | TRUE |
| B | 0.0000031 | -0.5988984 | FALSE |
| D | 0.0003173 | 0.3992603 | TRUE |
The result shows that stages D and E are enriched, which is the most susceptible period for Cognitive Disorder. In human data, the last stage, upper layers, is enriched. Thus, the mice and human results are match: they both enriched at the last stage.
Plot the logodds & pvalue v.s. cluster of Cognitive Disorder for both human and mouse model. We observe that for both human and mouse, the logodds is a slight slop and pvalue is a “mountain” shape, which shows that the result are similar between human and mouse.
twoord.plot(lx = c(1, 2, 3, 4, 5),
ly = c(-0.6771831, -0.5988984, -0.1574106, 0.3992603, 0.7325017),
rx = c(1, 2, 3, 4, 5),
ry = c(0.0000000, 0.0000031, 0.1679806, 0.0003173, 0.0000000),
xlab="Cluster",
ylab="Logodds",
rylab="Pvalue",
main="Logodds & Pvalue of Cognitive Disorder for mouse model",
xticklab = c('A','B','C', 'D','E'),
lylim=c(min(clusters$logodds)-0.1, max(clusters$logodds)+0.1),
type=c('b','b')
)
twoord.plot(lx = c(1, 2, 3, 4, 5, 6),
ly = c(-0.4037, -0.3873, 0.1589, 0.1276, -0.02535, 0.4989),
rx = c(1, 2, 3, 4, 5, 6),
ry = c(0.0001, 0.0008, 0.1853, 0.2592, 0.9014, 0),
xlab="Cluster",
ylab="Logodds",
rylab="Pvalue",
main="Logodds & Pvalue of Cognitive Disorder for human model",
xticklab = c('Pluripotency','Neuroectoderm','Neural Differentiation',
'Cortical Specification','Deep Layers', 'Upper Layers'),
lylim=c(min(clusters$logodds)-0.1, max(clusters$logodds)+0.1),
type=c('b','b')
)
Zika virus will cause infants to be born with microcephaly, which is smaller brain size, if mother infected during pregnancy. Stage B is the most susceptable period of Zika.
Any disorder that cognitive function of a person is significantly affected, and normal functioning is impossible without treatment, will be cognitive disorders. Alzheimer’s disease is one of the most common cognitive disorders, and it has affectes approximately 5.1 million Americans. It will causes memeory loss and other cognitive problems. For human, the last stage upper layers is most susceptible to the cognitve disorders, and similar result been found in mouse model as well. For mouse, the stages D and E are most susceptiabel, which is also the last stages.
```